Since the revolutionary draft of Neanderthal genome https://science.sciencemag.org/content/328/5979/710 paper, a few studies had been carried out to systematically establish the maps of regions of Neanderthal introgression into modern humans. Those regions had were hypothesized to have multiple functional effects related to skin color, male fertility etc.
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'neanderthal_dna.png', width=2000)
Since both Reich and Akey did the 1000G Neanderthal Introgression calls using hg19 version of human reference genome, we downloaded the hg19 from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ and prepared it for fast sequence extraction with samtools faidx. Let us test it and print the sequence of the SLC16A11 gene which is a gene inherited from Neandertals and known to have strong association with Type 2 Diabetes (T2D).
%%bash
cd /home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/
samtools faidx hg19.fa.gz chr17:6944941-6947411 | sed '1d' | tr -d '\n'
We can also write a simple Python script that extracts sequences from the fasta-file from certain regions. Again, we extract the SLC16A11 sequance and compare it with the sequence extracted with samtools:
import gzip
def getfasta(chr, start, end):
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
start_reading = 0
line_to_print = []
with gzip.open(Path + 'hg19.fa.gz', 'rb') as fin:
for line in fin:
line = line.decode('utf-8').rstrip()
if start_reading == 1 and line[0] != '>':
line_to_print.append(line)
if line == '>chr' + str(chr):
start_reading = 1
print('\x1b[31m' + str(''.join(line_to_print)[start-1:end]) + '\x1b[0m')
getfasta(17, 6944941, 6947411)
import gzip
def getfasta(chr, start, end):
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
start_reading = 0
line_to_print = []
with gzip.open(Path + 'hg19.fa.gz', 'rb') as fin:
for line in fin:
line = line.decode('utf-8').rstrip()
if start_reading == 1 and line[0] != '>':
line_to_print.append(line)
if line == '>chr' + str(chr):
start_reading = 1
print('\x1b[31m' + str(''.join(line_to_print)[start-1:end]) + '\x1b[0m')
getfasta(17, 6946902, 6946904)
The sequences look identical, still it is much, much faster with samtools (optimized C++ code) than with my Python implementation, so in the future for creating a fasta-file containing the Neandertal introgressed regions, we will do it via samtools faidx. Now let us turn to the Neandertal introgressed regions.
We downloaded the merged maps of Neandertal introgression for European + Asian 1000G populations published in Vernot and Akey, Science, 2014 from here https://akeylab.princeton.edu/downloads.html. Now let us read the coordinates of the introgressed regions and extract their sequences from the hg19 human genome and write the sequences in a new "introgression" fasta-file:
import pandas as pd
Path = '/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/'
intr_coords = pd.read_csv(Path + '1000G_Akey_NeandIntrogression/all_haplotypes_populations.bed.all',
header = None, sep = "\t")
intr_coords.head()
intr_coords.shape
Let us check the distribution of the lengths of the introgressed regions, this information will be important later when we select the k paremeter for the k-mer analysis when we split the sequences ("biological text") into k-mers ("biological words"):
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
intr_lengths = intr_coords.iloc[:, 2]-intr_coords.iloc[:, 1]
sns.distplot(intr_lengths)
plt.title("Distribution of Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.xlabel("Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
from scipy import stats
print(stats.describe(intr_lengths))
We can see that the minimal DNA stretch of Neanderthal introgression is around 5 kbp and the maximal is close to 1 Mbp. Nevertheless, taking into account that the inbreeding occured approximately 50 000 years ago which is equaivalent to approximately 2000 human generations ago, this would be equivalent to less than 100 kbp regions survived from Neanderthals according to the estimates of David Reich https://www.nature.com/articles/nature12961. Therefore, the DNA regions from the right tail of the distribution are highly unlikely to trully come from Neanderthals but are most likely false positives, and the DNA sequences close to the mode of the distribution should be prioritized for further analysis.
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
a = 0
with open('hg19_introgr_regions.fa', 'a') as fp:
for i in range(intr_coords.shape[0]):
coord = str(str(intr_coords.iloc[i, 0]) + ':'
+ str(intr_coords.iloc[i, 1]) + '-' + str(intr_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
a = a + 1
if a%1000 == 0:
print('Finished ' + str(a) + ' Neanderthal introgressed haplotypes')
It is interesting to try to make a simple k-mer analysis with kat, we downloaded and installed kat from here https://kat.readthedocs.io/en/latest/index.html and run it using the following command line:
# kat hist hg19_introgr_regions.fa
# kat plot spectra-hist -x 500 kat.hist
'''
Kmer Analysis Toolkit (KAT) V2.4.2
Running KAT in HIST mode
------------------------
Input 1 is a sequence file. Counting kmers for input 1 (hg19_introgr_regions.fa) ...
Warning: Specified hash size insufficent - attempting to double hash size... success!
Warning: Specified hash size insufficent - attempting to double hash size... success!
Warning: Specified hash size insufficent - attempting to double hash size... success!
done. Time taken: 517.0s
Bining kmers ... done. Time taken: 51.8s
Merging counts ... done. Time taken: 0.0s
Saving results to disk ... done. Time taken: 0.0s
Creating plot ... done. Time taken: 1.4s
Analysing peaks
---------------
Analysing distributions for: kat.hist ... done. Time taken: 3.7s
K-mer frequency spectra statistics
----------------------------------
K-value used: 27
Peaks in analysis: 3
Global minima @ Frequency=77x (1054)
Global maxima @ Frequency=79x (1085)
Overall mean k-mer frequency: 163x
Index Left Mean Right StdDev Max Volume Description
------- ------ ------ ------- -------- ----- -------- -------------
1 14.41 79 143.59 32.3 306 24648 1X
2 17.66 156 294.34 69.17 234 40154 2X
3 30.04 395 759.96 182.48 31 10257 5X
Calculating genome statistics
-----------------------------
Assuming that homozygous peak is the largest in the spectra with frequency of: 79x
Homozygous peak index: 1
CAUTION: the following estimates are based on having a clean spectra and having identified
the correct homozygous peak!
Estimated genome size: 0.14 Mbp
Creating plots
--------------
Plotting K-mer frequency distributions ... done. Saved to: None
KAT HIST completed.
Total runtime: 574.5s
'''
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'kat_intr-spectra-hist.png', width=2000)
Now we need a set of control sequences, i.e. a set of identical length regions extracted from the regions of depleted Neanderthal ancestry.
intr_coords.head()
list(intr_lengths)[0:10]
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header=None, sep="\t")
chr_sizes.head()
i = 0
chr_df = intr_coords[intr_coords[0] == intr_coords[0][i]]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == intr_coords[0][i]].iloc[:,1]))
reg_end = reg_start + intr_lengths[i]
print(reg_start)
for j in range(chr_df.shape[0]):
b1 = chr_df[1][j]
b2 = chr_df[2][j]
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
print(reg_start, reg_end, chr_df[1][j], chr_df[2][j])
overlap = True
break
else:
overlap = False
print(intr_coords[0][i], reg_start, reg_end)
chr_list = []
start_list = []
end_list = []
for i in range(intr_coords.shape[0]):
#print(i)
chr_df = intr_coords[intr_coords[0] == intr_coords[0][i]]
b1 = chr_df[1][i]
b2 = chr_df[2][i]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == intr_coords[0][i]].iloc[:,1]))
reg_end = reg_start + intr_lengths[i]
#print(reg_start)
for j in range(chr_df.shape[0]):
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
#print(reg_start, reg_end, chr_df[1][j], chr_df[2][j])
overlap = True
break
else:
overlap = False
#print(intr_coords[0][i], reg_start, reg_end)
chr_list.append(intr_coords[0][i])
start_list.append(reg_start)
end_list.append(reg_end)
depl_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
depl_coords.head()
depl_coords.shape
intr_coords.head()
depl_lengths = depl_coords.iloc[:, 2] - depl_coords.iloc[:, 1]
list(intr_lengths)[0:10]
list(depl_lengths)[0:10]
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
with open('hg19_depl_regions.fa', 'a') as fp:
for i in range(depl_coords.shape[0]):
coord = str(str(depl_coords.iloc[i, 0]) + ':'
+ str(depl_coords.iloc[i, 1]) + '-' + str(depl_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
# kat hist hg19_depl_regions.fa
'''
Kmer Analysis Toolkit (KAT) V2.4.2
Running KAT in HIST mode
------------------------
Input 1 is a sequence file. Counting kmers for input 1 (hg19_depl_regions.fa) ...
Warning: Specified hash size insufficent - attempting to double hash size... success!
Warning: Specified hash size insufficent - attempting to double hash size... success!
done. Time taken: 303.3s
Bining kmers ... done. Time taken: 61.1s
Merging counts ... done. Time taken: 0.0s
Saving results to disk ... done. Time taken: 0.0s
Creating plot ... done. Time taken: 1.3s
Analysing peaks
---------------
Analysing distributions for: kat.hist ... done. Time taken: 3.9s
K-mer frequency spectra statistics
----------------------------------
K-value used: 27
Peaks in analysis: 3
Global minima @ Frequency=68x (1358)
Global maxima @ Frequency=70x (1372)
Overall mean k-mer frequency: 146x
Index Left Mean Right StdDev Max Volume Description
------- ------ ------ ------- -------- ----- -------- -------------
1 13.66 70 126.34 28.17 406 28539 1X
2 16.61 138 259.39 60.69 294 44290 2X
3 28.38 350 671.62 160.81 38 12552 5X
Calculating genome statistics
-----------------------------
Assuming that homozygous peak is the largest in the spectra with frequency of: 70x
Homozygous peak index: 1
CAUTION: the following estimates are based on having a clean spectra and having identified the correct homozygous peak!
Estimated genome size: 0.15 Mbp
Creating plots
--------------
Plotting K-mer frequency distributions ... done. Saved to: None
KAT HIST completed.
Total runtime: 370.1s
'''
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'kat_depl-spectra-hist.png', width=2000)
We do not see any different functional behavior in the k-mer spectra between Neanderthal introgressed and depleted regions. Let us plot the two spectra against each other in order to see if there are any differences:
import numpy as np
import matplotlib.pyplot as plt
intr_spectr = pd.read_csv('kat_intr.hist', header = None, comment = "#", sep = " ")
depl_spectr = pd.read_csv('kat_depl.hist', header = None, comment = "#", sep = " ")
plt.figure(figsize=(20,15))
plt.plot(intr_spectr[0], intr_spectr[1], c = 'blue')
plt.plot(depl_spectr[0], depl_spectr[1], c = 'red')
plt.xlim(70, 400)
plt.ylim(0, 1000)
plt.title('K-mer Spectra of Neanderthal Introgressed and Depleted Regions', fontsize = 20)
plt.xlabel('K-mer Multiplicity', fontsize = 20)
plt.ylabel('Frequency', fontsize = 20)
plt.show()
So no obvious difference between the Neanderthal introgressed and depleted regions. However, one problem we had here is high proportion of regions containing long stretches of N nucleotides, meaning sequencing or genome assembly failed for these regions, might be repetative regions. Here we count how much N-containing lines we have in the fasta-files.
!grep -c N hg19_introgr_regions.fa
!grep -c N hg19_depl_regions.fa
Now we are going to clean the two fasta-files, i.e. Neanderthal introgressed and depleted regions, from entries containing N nucleotides. We could have selected from the very beginning only N-free regions, but since the damage is already done, for simplicity now we just remove the corresponding entries in both fasta-files if at least one of them contains N-nucleotides. In other words, if e.g. the introgressed fasta file contains a 5 kbp stretch of nucleotides where at least one of them is N, then we delete the whole stretch in the introgressed fasta-file and the corresponding 5 kbp stretch in the depeleted fasta-file.
from Bio import SeqIO
i = 0
for record in SeqIO.parse('hg19_introgr_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
i = i + 1
print(i)
i = 0
for record in SeqIO.parse('hg19_depl_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
i = i + 1
print(i)
It looks like the introgressed fasta-file has 11 entries containing N-nucleotide while the depleted fasta-file has 505 entries containing N-nucleotide. If those entries do not overlap, we will remove 516 entries from both fasta-files and convert the nucleoties into upper case in parallel, then will write the clean fasta-files into new files. We are going to convert the rest of the 6721 - 516 = 6205 entries into k-mers, which means we are going to have plenty of training daya for running Convolutional and Recurrent Neural Networks (CNN / RNN).
from Bio import SeqIO
for record in SeqIO.parse('hg19_introgr_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
print(record.id)
from Bio import SeqIO
intr_file = 'hg19_introgr_regions.fa'
depl_file = 'hg19_depl_regions.fa'
a = 0
i = 0
with open('hg19_introgr_clean.fa', 'a') as intr_out, open('hg19_depl_clean.fa', 'a') as depl_out:
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
upper_intr = intr.seq.upper()
upper_depl = depl.seq.upper()
a = a + 1
if a%1000 == 0:
print('Finished ' + str(a) + ' entries')
if 'N' not in str(upper_intr) and 'N' not in str(upper_depl):
intr.seq = upper_intr
SeqIO.write(intr, intr_out, 'fasta')
depl.seq = upper_depl
SeqIO.write(depl, depl_out, 'fasta')
i = i + 1
else:
continue
print('We have processed ' + str(a) + ' entries and written ' + str(i) + ' entries to two fasta-files')
Let us make sure that there are no N-nucleotides present in the clean fasta-files:
!grep -c N hg19_introgr_clean.fa
!grep -c N hg19_depl_clean.fa
Now it is time to build Neanderthal intogressed vs. depleted sequences for further inputing them to the Convolutional Neural Network (CNN). For simplicity we will select all sequences of the same length equal to 5251 bp meaning the minimal length of the introgressed stretch of DNA. For this purpose we are going to cut all the sequences longer than 5251.
import os
from Bio import SeqIO
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
intr_file = 'hg19_introgr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
a = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
#intr_seqs.append(str(intr.seq)[0:np.min(depl_lengths)])
#depl_seqs.append(str(depl.seq)[0:np.min(depl_lengths)])
s_intr = str(intr.seq)[0:32]
s_depl = str(depl.seq)[0:32]
if s_intr.count('A')>0 and s_intr.count('C')>0 and s_intr.count('G')>0 and s_intr.count('T')>0 and \
s_depl.count('A')>0 and s_depl.count('C')>0 and s_depl.count('G')>0 and s_depl.count('T')>0:
intr_seqs.append(s_intr)
depl_seqs.append(s_depl)
a = a + 1
if a%1000 == 0:
print('Finished ' + str(a) + ' entries')
sequences = intr_seqs + depl_seqs
len(sequences)
import numpy as np
labels = list(np.ones(len(intr_seqs))) + list(np.zeros(len(depl_seqs)))
len(labels)
The next step is to organize the data into a format that can be passed into a deep learning algorithm. Most deep learning algorithms accept data in the form of vectors or matrices (or more generally, tensors). To get each DNA sequence in the form of a matrix, we use one-hot encoding, which encodes every base in a sequence in the form of a 4-dimensional vector, with a separate dimension for each base. We place a "1" in the dimension corresponding to the base found in the DNA sequence, and "0"s in all other slots. We then concatenate these 4-dimensional vectors together along the bases in the sequence to form a matrix.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
# The LabelEncoder encodes a sequence of bases as a sequence of integers: 0, 1, 2 and 3
integer_encoder = LabelEncoder()
# The OneHotEncoder converts an array of integers to a sparse matrix where
# each row corresponds to one possible value of each feature, i.e. only 01 and 1 are present in the matrix
one_hot_encoder = OneHotEncoder()
input_features = []
for sequence in sequences:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
#print(input_features.shape)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()
print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
Now we are going to split the data set into train and test data sets:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(
input_features, input_labels, test_size = 0.2, random_state = 42)
train_features.shape
train_labels.shape
test_features.shape
test_labels.shape
Now everything is ready for the classification with Convolutional Neural Network (CNN). Here we choose a simple 1D convolutional neural network (CNN), which is commonly used in deep learning for functional genomics applications.
A CNN learns to recognize patterns that are generally invariant across space, by trying to match the input sequence to a number of learnable "filters" of a fixed size. In our dataset, the filters will be motifs within the DNA sequences. The CNN may then learn to combine these filters to recognize a larger structure (e.g. the presence or absence of the Neanderthal introgression site on a sequence). We will start with defining Convolutional Neural Network (CNN) model and summarize the fitting parameters of the model.
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1
import warnings
warnings.filterwarnings('ignore')
model = Sequential()
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform',
input_shape = (train_features.shape[1], 4)))
model.add(Activation("relu"))
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(8, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.5))
model.add(Dense(2, activation = 'softmax'))
epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
Now, we are ready to go ahead and train the neural network. We will further divide the training set into a training and validation set. We will train only on the reduced training set, but plot the loss curve on both the training and validation sets. Once the loss for the validation set stops improving or gets worse throughout the learning cycles, it is time to stop training because the model has already converged and may be just overfitting.
import warnings
warnings.filterwarnings('ignore')
history = model.fit(train_features, train_labels,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
The best way to evaluate whether the network has learned to classify sequences is to evaluate its performance on a fresh test set consisting of data that it has not observed at all during training. Here, we evaluate the model on the test set and plot the results as a confusion matrix.
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize=(15,10))
predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis=1),
np.argmax(predicted_labels, axis=1))
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
#plt.xticks([0, 1]); plt.yticks([0, 1])
#plt.grid('off')
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
The accuracy is not fantastic, apparently there is not enough signal in the data, i.e. Neanderthal introgressed regions look very similarly in sense of their sequence to the regions with reduced Neanderthal ancestry. This can have many reasons both biological and technical. One technical artifact can be bad annotation of the Nenanderthal introgressed regions byt Vernot and Akey, we could use David Reich's annotation as an alternative.
Let us now take a step back and simplify the problem. Let us address something simple which should be solvable within the current framework. For example we can try to classify sequences to belong either to gene or intergenic regions.
Here we will follow roughy the same line as for Neanderthal introgressed vs. depleted sequence classification and perform a simple "sentiment analysis", i.e. a simple gene vs. non-gene sequence classification. This should be a simple task as genes have start and stop codon regions which should be recognized and learnt by CNN quite easily. We will start with building an annotation file for protein-coding genes. We downloaded RefSeq annotation file from http://genome.ucsc.edu/cgi-bin/hgTables as a text-file and used "genePredToGtf" tool to build the refGene_hg19.gtf gtf-annotation file.
!cut -f9 refGene_hg19.gtf | cut -d ';' -f1 | cut -d ' ' -f2 | uniq | tr -d '\"' > refseq_gene_list_hg19.txt
%load_ext rpy2.ipython
%%R
library("biomaRt")
ensembl <- useMart("ensembl",host="grch37.ensembl.org")
ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
refseq_genes <- scan("refseq_gene_list_hg19.txt", what = "character")
output <- getBM(attributes=c('hgnc_symbol', 'chromosome_name', 'start_position', 'end_position', 'strand'),
filters = c('hgnc_symbol'),values = refseq_genes,mart = ensembl)
head(output)
%%R
output_clean <- output[as.character(output$chromosome_name)%in%c(as.character(seq(1:22)),"X","Y"),]
dim(output_clean)
%%R
head(output_clean)
%%R
output_to_write <- output_clean
output_to_write$hgnc_symbol <- NULL
output_to_write$strand <- NULL
output_to_write$chromosome_name <- paste0("chr",output_to_write$chromosome_name)
write.table(output_to_write, file="gene_coords.txt",col.names=FALSE,row.names=FALSE,quote=FALSE,sep="\t")
import pandas as pd
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
gene_coords = pd.read_csv(Path + 'gene_coords.txt',
header=None, sep="\t")
gene_coords.head()
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
gene_lengths = gene_coords.iloc[:, 2]-gene_coords.iloc[:, 1]
sns.distplot(gene_lengths)
plt.title("Distribution of Gene Lengths", fontsize = 20)
plt.xlabel("Lengths of Genes", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
from scipy import stats
print(stats.describe(gene_lengths))
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
with open('hg19_gene_regions.fa', 'a') as fp:
for i in range(gene_coords.shape[0]):
coord = str(str(gene_coords.iloc[i, 0]) + ':'
+ str(gene_coords.iloc[i, 1]) + '-' + str(gene_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header = None, sep = "\t")
chr_sizes.head()
chr_list = []
start_list = []
end_list = []
for i in range(gene_coords.shape[0]):
chr_df = gene_coords[gene_coords[0] == gene_coords[0][i]]
b1 = chr_df[1][i]
b2 = chr_df[2][i]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == gene_coords[0][i]].iloc[:,1]))
reg_end = reg_start + gene_lengths[i]
for j in range(chr_df.shape[0]):
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
overlap = True
break
else:
overlap = False
chr_list.append(gene_coords[0][i])
start_list.append(reg_start)
end_list.append(reg_end)
notgene_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
notgene_coords.to_csv("notgene_coords.txt", index = False, header = False, sep = "\t")
notgene_coords.head()
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
with open('hg19_notgene_regions.fa', 'a') as fp:
for i in range(notgene_coords.shape[0]):
coord = str(str(notgene_coords.iloc[i, 0]) + ':'
+ str(notgene_coords.iloc[i, 1]) + '-' + str(notgene_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
!grep -c N hg19_gene_regions.fa
!grep -c N hg19_notgene_regions.fa
from Bio import SeqIO
i = 0
for record in SeqIO.parse('hg19_gene_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
i = i + 1
print(i)
i = 0
for record in SeqIO.parse('hg19_notgene_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
i = i + 1
print(i)
from Bio import SeqIO
gene_file = 'hg19_gene_regions.fa'
notgene_file = 'hg19_notgene_regions.fa'
a = 0
i = 0
with open('hg19_gene_clean.fa', 'a') as gene_out, open('hg19_notgene_clean.fa', 'a') as notgene_out:
for gene, notgene in zip(SeqIO.parse(gene_file, 'fasta'), SeqIO.parse(notgene_file, 'fasta')):
upper_gene = gene.seq.upper()
upper_notgene = notgene.seq.upper()
a = a + 1
if a%1000 == 0:
print('Finished ' + str(a) + ' entries')
if 'N' not in str(upper_gene) and 'N' not in str(upper_notgene):
gene.seq = upper_gene
SeqIO.write(gene, gene_out, 'fasta')
notgene.seq = upper_notgene
SeqIO.write(notgene, notgene_out, 'fasta')
i = i + 1
else:
continue
print('We have processed ' + str(a) + ' entries and written ' + str(i) + ' entries to two fasta-files')
!grep -c N hg19_gene_clean.fa
!grep -c N hg19_notgene_clean.fa
import os
from Bio import SeqIO
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
gene_file = 'hg19_gene_clean.fa'
notgene_file = 'hg19_notgene_clean.fa'
a = 0
gene_seqs = []
notgene_seqs = []
for gene, notgene in zip(SeqIO.parse(gene_file, 'fasta'), SeqIO.parse(notgene_file, 'fasta')):
#gene_seqs.append(str(gene.seq)[0:np.min(gene_lengths)])
#notgene_seqs.append(str(notgene.seq)[0:np.min(notgene_lengths)])
s_gene = str(gene.seq)[0:32]
s_notgene = str(notgene.seq)[0:32]
if s_gene.count('A')>0 and s_gene.count('C')>0 and s_gene.count('G')>0 and s_gene.count('T')>0 and \
s_notgene.count('A')>0 and s_notgene.count('C')>0 and s_notgene.count('G')>0 and s_notgene.count('T')>0:
gene_seqs.append(s_gene)
notgene_seqs.append(s_notgene)
a = a + 1
if a%1000 == 0:
print('Finished ' + str(a) + ' entries')
sequences = gene_seqs + notgene_seqs
len(sequences)
import numpy as np
labels = list(np.ones(len(gene_seqs))) + list(np.zeros(len(notgene_seqs)))
len(labels)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
# The LabelEncoder encodes a sequence of bases as a sequence of integers: 0, 1, 2 and 3
integer_encoder = LabelEncoder()
# The OneHotEncoder converts an array of integers to a sparse matrix where
# each row corresponds to one possible value of each feature, i.e. only 01 and 1 are present in the matrix
one_hot_encoder = OneHotEncoder()
input_features = []
for sequence in sequences:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
#print(input_features.shape)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()
print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(
input_features, input_labels, test_size = 0.2, random_state = 42)
train_features.shape
train_labels.shape
test_features.shape
test_labels.shape
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1
import warnings
warnings.filterwarnings('ignore')
model = Sequential()
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform',
input_shape = (train_features.shape[1], 4)))
model.add(Activation("relu"))
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(8, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.5))
model.add(Dense(2, activation = 'softmax'))
epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(train_features, train_labels,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize=(15,10))
predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis=1),
np.argmax(predicted_labels, axis=1))
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
#plt.xticks([0, 1]); plt.yticks([0, 1])
#plt.grid('off')
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
import keras.backend as K
def compute_salient_bases(model, x):
input_tensors = [model.input]
gradients = model.optimizer.get_gradients(model.output[0][1], model.input)
compute_gradients = K.function(inputs = input_tensors, outputs = gradients)
x_value = np.expand_dims(x, axis=0)
gradients = compute_gradients([x_value])[0][0]
sal = np.clip(np.sum(np.multiply(gradients,x), axis=1),a_min=0, a_max=None)
return sal
sequence_index = 12
K.set_learning_phase(1) #set learning phase
sal = compute_salient_bases(model, input_features[sequence_index])
plt.figure(figsize=[16,5])
barlist = plt.bar(np.arange(len(sal)), sal)
[barlist[i].set_color('C1') for i in range(0,6)]
[barlist[j].set_color('C1') for j in range(26,32)]
plt.xlabel('Bases')
plt.ylabel('Magnitude of saliency values')
plt.xticks(np.arange(len(sal)), list(sequences[sequence_index]));
plt.title('Saliency map for bases in one of the ancient sequences'
' (orange indicates the informative bases in motif)');
plt.show()
This looks much much better! The model can distinguish between segments coming from genes and not genes with close to 80% accuracy, which can be probably improved by considering longer stretches of DNA. So one reason this kind of classification did not work for Neanderthal introgressed vs. depleted sequences can be due to the poor callcing / annotation provided by Vernot and Akey in their Science 2014 paper. As a prove that there is something wrong with the introgressed sequence calling from Vernot and Akey 2014, we can calculate how many times their sequences overlap the refseq genes:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/1000G_Akey_NeandIntrogression
bedtools intersect -a ./genes/gene_coords.txt -b $Path/all_haplotypes_populations.bed.all | wc -l
and now let us compare this number with how many times their sequences overlap with intergenic regions:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/1000G_Akey_NeandIntrogression
bedtools intersect -a ./genes/notgene_coords.txt -b $Path/all_haplotypes_populations.bed.all | wc -l
Here we see that there is at least no enrichment of the sequences in either gene or intergenic regions. In contrast, it looks like the nNeanderthal introgressed sequences seem to be slightly enriched in the gene regions which contradicts one of the main conclusions of David Reich, Nature 2014, where they explicitly state that Neanderthal sequences seem to be depleted from the gene regions, which in evolutionary context can be viewed as Natural Selection did not prefer mixing with Neanderthals for the sake of fitness. Now we will try to use the better annotation from Vernot and Akey, Science 2016, and the annotation from David Reich, Nature 2016, and redo the Neanderthal introgressed vs. depeleted sequence classification.
We downloaded the merged maps of Neandertal introgression for European + Asian 1000G populations published in Vernot and Akey, Science, 2014 from here https://akeylab.princeton.edu/downloads.html. However, even though the data looked strange (the segments looked too different from the ones from Reich, Nature 2014) and we could not establish any good model distinguishing between Neanderthal introgressed and depleted regions. Next we downloaded the recall data from Vernot et al., Science 2016, from here https://drive.google.com/drive/folders/0B9Pc7_zItMCVWUp6bWtXc2xJVkk. Btw here we have an opposite results of intersecting the introgressed regions with gene and non-gene regions.
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a ./genes/gene_coords.txt -b $Path/Akey_intr_coords.bed | wc -l
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a ./genes/notgene_coords.txt -b $Path/Akey_intr_coords.bed | wc -l
Here it is clear that the Neanderthal introgression regions overlap more often with the intergenic regions, i.e. the genes are depleted for Neanderthal introgression showing evolutionary losses in fitness via inbreeding of modern humans with Neanderthals. The same effect we can see with the Neanderthal introgressed regions from David Reich, 2016, after we have selected the most confident regions, i.e. with probability >=90%. Btw, Recih and colleagues provide candidate regions with probability >=50%, but to be on a safe side we select only most confident intervals.
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a ./genes/gene_coords.txt -b $Path/Reich_intr_coords_high_conf.bed | wc -l
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a ./genes/notgene_coords.txt -b $Path/Reich_intr_coords_high_conf.bed | wc -l
We can also do something more intelligent and compute the empirical histogram of intersects with randomly constructed non-gene regions. In other words, we are going to create notgene_coords.txt file multiple times and run bedtools to count the number of intersects with the Neanderthal introgressed regions from Vernot and Akey, Science 2016, in this way we obtain an empirical histogram. The goal is to show that the observed number of intersects 178302 is extreme, i.e. located in the 5% tail of the histogram.
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header = None, sep = "\t")
chr_sizes.head()
import pandas as pd
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/genes/'
gene_coords = pd.read_csv(Path + 'gene_coords.txt', header=None, sep="\t")
gene_lengths = gene_coords.iloc[:, 2]-gene_coords.iloc[:, 1]
gene_coords.head()
import os
import numpy as np
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
#perm_n = []
for k in range(100):
chr_list = []
start_list = []
end_list = []
for i in range(gene_coords.shape[0]):
chr_df = gene_coords[gene_coords[0] == gene_coords[0][i]]
b1 = chr_df[1][i]
b2 = chr_df[2][i]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == gene_coords[0][i]].iloc[:,1]))
reg_end = reg_start + gene_lengths[i]
for j in range(chr_df.shape[0]):
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
overlap = True
break
else:
overlap = False
chr_list.append(gene_coords[0][i])
start_list.append(reg_start)
end_list.append(reg_end)
notgene_rand_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
notgene_rand_coords.to_csv("temp.txt", index = False, header = False, sep = "\t")
akey_path = '/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/'
with open('n.txt', 'w') as fp:
subprocess.run(['bedtools', 'intersect', '-a', 'temp.txt', '-b',
akey_path + 'Akey_intr_coords.bed'], stdout = fp)
akey_n = pd.read_csv('n.txt', header=None, sep="\t")
print(k, akey_n.shape[0])
perm_n.append(akey_n.shape[0])
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a temp.txt -b $Path/Akey_intr_coords.bed | wc -l
len(perm_n)
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.xlim([170000,200000])
plt.axvline(x=178302,linewidth=4, color='r')
sns.distplot(perm_n)
plt.title("Distribution of Non-Gene Intersects: Vernot and Akey, Science 2016", fontsize = 20)
plt.xlabel("Number of Intersects Between Neanderthal Introgressed and Non-Gene Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
The right vertical line here is the number of intersects between the Neanderthal introgressed and gene regions. The blue histogram represents the number of intersects between the Neanderthal introgressed and intergenic regions. The intergenic regions have exactly the same length as the Neanderthal introgressed regions and we drew multiple times exactly the same number of intergenic regions as the number of Neandrthal introgressed regions. What we see from this figure is that Neanderthal introgressed regions are significantly enriched in the intergenic regions rather than gene regions. Therefore, for some reason Evolution did not want to keep Neanderthal sequences within genes due to some unknown loss in fitness, but instead Evolution tried to get rid of the Neanderthal sequences from the genes of modern humans.
This was the Neanderthal introgression coordinates from Vernot and Akey, Science 2016. Let us now use the Neanderthal introgression coordinates from David Reich and colleagues from Nature 2016, and check whether we observe the same type of enrichment. Again, the hypothesis is that the number 72394 of intersects between Reich's Neanderthal introgression regions and gene regions is significantly lower than the corresponding typical number of intersects from the intergenic segments.
import os
import numpy as np
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
#perm_n_reich = []
for k in range(160):
chr_list = []
start_list = []
end_list = []
for i in range(gene_coords.shape[0]):
chr_df = gene_coords[gene_coords[0] == gene_coords[0][i]]
b1 = chr_df[1][i]
b2 = chr_df[2][i]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == gene_coords[0][i]].iloc[:,1]))
reg_end = reg_start + gene_lengths[i]
for j in range(chr_df.shape[0]):
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
overlap = True
break
else:
overlap = False
chr_list.append(gene_coords[0][i])
start_list.append(reg_start)
end_list.append(reg_end)
notgene_rand_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
notgene_rand_coords.to_csv("temp.txt", index = False, header = False, sep = "\t")
reich_path = '/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/'
with open('n.txt', 'w') as fp:
subprocess.run(['bedtools', 'intersect', '-a', 'temp.txt', '-b',
reich_path + 'Reich_intr_coords_high_conf.bed'], stdout = fp)
reich_n = pd.read_csv('n.txt', header=None, sep="\t")
print(k, reich_n.shape[0])
perm_n_reich.append(reich_n.shape[0])
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a temp.txt -b $Path/Reich_intr_coords_high_conf.bed | wc -l
len(perm_n_reich)
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.xlim([69000,78000])
plt.axvline(x = 72394,linewidth = 4, color = 'r')
sns.distplot(perm_n_reich)
plt.title("Distribution of Non-Gene Intersects: Reich, Nature 2016", fontsize = 20)
plt.xlabel("Number of Intersects Between Neanderthal Introgressed and Non-Gene Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
We can see that Reich introgression segments do not really look to be enriched in the intergenic regions. This is strange, perhaps one could increase the confidence of Neanderthal introgression from 90% to 99%. Anyhow, we will continue for now with Akey's segments.
import pandas as pd
Path = '/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/'
intr_coords = pd.read_csv(Path + 'Akey_intr_coords.bed', header = None, sep = "\t")
intr_coords.head()
intr_coords.shape
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
intr_lengths = intr_coords.iloc[:, 2]-intr_coords.iloc[:, 1]
sns.distplot(intr_lengths)
plt.title("Distribution of Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.xlabel("Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
from scipy import stats
print(stats.describe(intr_lengths))
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
a = 0
with open('hg19_introgr_regions.fa', 'a') as fp:
for i in range(intr_coords.shape[0]):
coord = str(str(intr_coords.iloc[i, 0]) + ':'
+ str(intr_coords.iloc[i, 1]) + '-' + str(intr_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' Neanderthal introgressed haplotypes')
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header=None, sep="\t")
chr_sizes.head()
chr_list = []
start_list = []
end_list = []
for i in range(intr_coords.shape[0]):
chr_df = intr_coords[intr_coords[0] == intr_coords[0][i]]
b1 = chr_df[1][i]
b2 = chr_df[2][i]
overlap = True
while overlap == True:
reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == intr_coords[0][i]].iloc[:,1]))
reg_end = reg_start + intr_lengths[i]
for j in range(chr_df.shape[0]):
if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
overlap = True
break
else:
overlap = False
chr_list.append(intr_coords[0][i])
start_list.append(reg_start)
end_list.append(reg_end)
depl_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
depl_coords.head()
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
with open('hg19_depl_regions.fa', 'a') as fp:
for i in range(depl_coords.shape[0]):
coord = str(str(depl_coords.iloc[i, 0]) + ':'
+ str(depl_coords.iloc[i, 1]) + '-' + str(depl_coords.iloc[i, 2]))
subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
!grep -c N hg19_introgr_regions.fa
!grep -c N hg19_depl_regions.fa
from Bio import SeqIO
i = 0
for record in SeqIO.parse('hg19_introgr_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
i = i + 1
print(i)
i = 0
for record in SeqIO.parse('hg19_depl_regions.fa', 'fasta'):
upper_record = record.seq.upper()
if 'N' in upper_record:
i = i + 1
print(i)
from Bio import SeqIO
intr_file = 'hg19_introgr_regions.fa'
depl_file = 'hg19_depl_regions.fa'
a = 0
i = 0
with open('hg19_introgr_clean.fa', 'a') as intr_out, open('hg19_depl_clean.fa', 'a') as depl_out:
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
upper_intr = intr.seq.upper()
upper_depl = depl.seq.upper()
a = a + 1
if a%1000 == 0:
print('Finished ' + str(a) + ' entries')
if 'N' not in str(upper_intr) and 'N' not in str(upper_depl):
intr.seq = upper_intr
SeqIO.write(intr, intr_out, 'fasta')
depl.seq = upper_depl
SeqIO.write(depl, depl_out, 'fasta')
i = i + 1
else:
continue
print('We have processed ' + str(a) + ' entries and written ' + str(i) + ' entries to two fasta-files')
!grep -c N hg19_introgr_clean.fa
!grep -c N hg19_depl_clean.fa
Now everything is ready for Neanderthal introgression vs. depleted regions classification. We will read the two clean fasta-files with the sequences of introgressed and depeleted regions, convert them to one-hot encoded representation and run a Convolutional Neural Network (CNN) classifier.
import os
from Bio import SeqIO
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
intr_file = 'hg19_introgr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
a = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
#intr_seqs.append(str(intr.seq)[0:np.min(depl_lengths)])
#depl_seqs.append(str(depl.seq)[0:np.min(depl_lengths)])
s_intr = str(intr.seq)[0:100]
s_depl = str(depl.seq)[0:100]
if s_intr.count('A')>0 and s_intr.count('C')>0 and s_intr.count('G')>0 and s_intr.count('T')>0 and \
s_depl.count('A')>0 and s_depl.count('C')>0 and s_depl.count('G')>0 and s_depl.count('T')>0:
intr_seqs.append(s_intr)
depl_seqs.append(s_depl)
a = a + 1
if a%10000 == 0:
print('Finished ' + str(a) + ' entries')
sequences = intr_seqs + depl_seqs
len(sequences)
import numpy as np
labels = list(np.ones(len(intr_seqs))) + list(np.zeros(len(depl_seqs)))
len(labels)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
# The LabelEncoder encodes a sequence of bases as a sequence of integers: 0, 1, 2 and 3
integer_encoder = LabelEncoder()
# The OneHotEncoder converts an array of integers to a sparse matrix where
# each row corresponds to one possible value of each feature, i.e. only 01 and 1 are present in the matrix
one_hot_encoder = OneHotEncoder()
input_features = []
for sequence in sequences:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
input_features.append(one_hot_encoded.toarray())
np.set_printoptions(threshold = 40)
#print(input_features.shape)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
input_features.shape
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()
print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(
input_features, input_labels, test_size = 0.2, random_state = 42)
train_features.shape
train_labels.shape
test_features.shape
test_labels.shape
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1
import warnings
warnings.filterwarnings('ignore')
model = Sequential()
model.add(Conv1D(filters = 64, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform',
input_shape = (train_features.shape[1], 4)))
model.add(Activation("relu"))
model.add(Conv1D(filters = 64, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(16, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.2))
model.add(Dense(2, activation = 'softmax'))
epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(train_features, train_labels,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 128, shuffle = True)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
from sklearn.metrics import confusion_matrix
import itertools
plt.figure(figsize=(15,10))
predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis=1),
np.argmax(predicted_labels, axis=1))
print('Confusion matrix:\n',cm)
cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
#plt.xticks([0, 1]); plt.yticks([0, 1])
#plt.grid('off')
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], '.2f'),
horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
The accuracy looks better now but still not fantastic. Perhaps some tricks from the Natural Language Processing (NLP) and word embeddings might help to increase the accuracy of classification.
Here we will try to include the Word Embedding layers in the Convolutional Neural Network (CNN) in order to perform dimensionality reduction, increase accuracy of classification and visualize the low-dimensional Word Embeddings.
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
docs = ['well done', 'good work', 'great effort', 'nice work', 'excellent', 'weak', 'poor effort',
'not good', 'poor work', 'could have done better']
labels = array([1,1,1,1,1,0,0,0,0,0])
sentences = [d.split(' ') for d in docs]
print(sentences)
import warnings
warnings.filterwarnings('ignore')
from gensim.models import Word2Vec
model = Word2Vec(sentences, min_count = 1)
print(model)
words = list(model.wv.vocab)
print(words)
print(model['work'])
my_labels = list(np.ones(8)) + list(np.zeros(6))
from sklearn.decomposition import PCA
from umap import UMAP
import matplotlib.pyplot as plt
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(result[:, 0], result[:, 1], c = my_labels, s = 100, cmap = 'tab10')
words = list(model.wv.vocab)
for i, word in enumerate(words):
#plt.annotate(word, xy=(result[i, 0], result[i, 1]), fontsize = 20)
plt.text(result[i, 0], result[i, 1], word, fontsize = 20, c = 'red')
plt.show()
from umap import UMAP
import matplotlib.pyplot as plt
X = model[model.wv.vocab]
umap_model = UMAP(n_neighbors = 5, min_dist = 0.5, n_components = 2)
umap = umap_model.fit_transform(X, y = my_labels)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = my_labels, s = 100, cmap = 'tab10')
plt.title('UMAP', fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20)
plt.ylabel("UMAP2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
plt.text(umap[i, 0], umap[i, 1], word, fontsize = 20, c = 'red')
plt.show()
from keras.preprocessing.text import Tokenizer
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode = 'count')
print(encoded_docs)
encoded_docs.shape
encoded_docs[9]
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
padded_docs = t.texts_to_matrix(docs, mode = 'count')
print(padded_docs)
print(padded_docs.shape)
max_length = padded_docs.shape[1]
vocab_size = padded_docs.shape[1]
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=2)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
Now we will do a simple sentiment analysis using the IMDB reviews. The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing. The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at imdb.com.
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
import os
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
from nltk.corpus import stopwords
import string
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# turn a doc into clean tokens
def clean_doc(doc):
# split into tokens by white space
tokens = doc.split()
# remove punctuation from each token
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens
# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens[0:20])
stop_words = list(set(stopwords.words('english')))
stop_words[0:5]
table = str.maketrans('', '', string.punctuation)
print(string.punctuation)
'book,#$'.translate(table)
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
# load doc
doc = load_doc(filename)
# clean doc
tokens = clean_doc(doc)
# update counts
vocab.update(tokens)
# load all docs in a directory
def process_docs(directory, vocab, is_trian):
# walk through all files in the folder
for filename in listdir(directory):
# skip any reviews in the test set
if is_trian and filename.startswith('cv9'):
continue
if not is_trian and not filename.startswith('cv9'):
continue
# create the full path of the file to open
path = directory + '/' + filename
# add doc to vocab
add_doc_to_vocab(path, vocab)
# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab, True)
process_docs('txt_sentoken/pos', vocab, True)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))
# keep tokens with a min occurrence
min_occurane = 2
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))
# save list to file
def save_list(lines, filename):
# convert lines to a single blob of text
data = '\n'.join(lines)
# open file
file = open(filename, 'w')
# write text
file.write(data)
# close file
file.close()
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# turn a doc into clean tokens
def clean_doc(doc, vocab):
# split into tokens by white space
tokens = doc.split()
# remove punctuation from each token
table = str.maketrans('', '', punctuation)
tokens = [w.translate(table) for w in tokens]
# filter out tokens not in vocab
tokens = [w for w in tokens if w in vocab]
tokens = ' '.join(tokens)
return tokens
# load all docs in a directory
def process_docs(directory, vocab, is_trian):
documents = list()
# walk through all files in the folder
for filename in listdir(directory):
# skip any reviews in the test set
if is_trian and filename.startswith('cv9'):
continue
if not is_trian and not filename.startswith('cv9'):
continue
# create the full path of the file to open
path = directory + '/' + filename
# load the doc
doc = load_doc(path)
# clean doc
tokens = clean_doc(doc, vocab)
# add to list
documents.append(tokens)
return documents
# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])
len(encoded_docs[0])
max_length
Xtrain.shape
# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1
vocab_size
# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=1)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))
Here we will try to include the Word Embedding layers in the Convolutional Neural Network (CNN) in order to perform dimensionality reduction, increase accuracy of classification and visualize the low-dimensional Word Embeddings. We start with reading the sequences of Neanderthal introgressed and depleted regions and merging them together for later classification.
import os
from Bio import SeqIO
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
intr_file = 'hg19_introgr_clean.fa'
depl_file = 'hg19_depl_clean.fa'
e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
intr_seqs.append(str(intr.seq)[0:100])
depl_seqs.append(str(depl.seq)[0:100])
'''
step = 100; jump = 1500; a = 0; b = step
for j in range(5):
s_intr = str(intr.seq)[a:b]
s_depl = str(depl.seq)[a:b]
if s_intr.count('A')>0 and s_intr.count('C')>0 and s_intr.count('G')>0 and s_intr.count('T')>0 and \
s_depl.count('A')>0 and s_depl.count('C')>0 and s_depl.count('G')>0 and s_depl.count('T')>0:
intr_seqs.append(s_intr)
depl_seqs.append(s_depl)
a = a + jump
b = a + step
'''
e = e + 1
if e%10000 == 0:
print('Finished ' + str(e) + ' entries')
sequences = intr_seqs + depl_seqs
len(sequences)
sequences[0:10]
Suppose the sequences in the list above are different texts. It is natural to consider k-mers as words of those texts. Different sequences can share the same k-mers indicating that the same words can be used in different texts. However, there are words / k-mers that are specific for certain texts / sequences, or their number can say something about the topic of the text / biology of the sequence. Here we are going to split each sequence into k-mers and construct sentences which represent lists of words / k-mers.
def getKmers(sequence, size):
return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
print('Building Neanderthal introgressed sequences')
intr_sentences = []
for i in range(len(intr_seqs)):
intr_sentences.append(getKmers(intr_seqs[i], 5))
print('Building Neanderthal depleted sequences')
depl_sentences = []
for i in range(len(depl_seqs)):
depl_sentences.append(getKmers(depl_seqs[i], 5))
print('Building merged Neanderthal introgressed and depleted sequences')
sentences = []
for i in range(len(sequences)):
sentences.append(getKmers(sequences[i], 5))
sentences[0][0:5]
The words / k-mers provide a vocabulary, we will determine its size later. We can also use the Counter class for an efficient counting of the words as well as displaying and making a barplot of the most common words for the merged Neanderthal introgressed and depleted regions.
from collections import Counter
Counter([item for sublist in sentences for item in sublist]).most_common(10)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
D = dict(Counter([item for sublist in sentences for item in sublist]).most_common(20))
plt.bar(range(len(D)), list(D.values()), align='center')
plt.title('Most Common K-mers', fontsize = 20)
plt.ylabel("Counts", fontsize = 20)
plt.xticks(rotation = 90)
plt.xticks(range(len(D)), list(D.keys()), fontsize = 20)
plt.show()
Now let us disaply the k-mer / word / token counts for Neanderthal introgressed regions:
import pandas as pd
intr_counts = dict(Counter([item for sublist in intr_sentences for item in sublist]))
kmers = list(intr_counts.keys())
counts = list(intr_counts.values())
intr_df = pd.DataFrame({'Kmer': kmers, 'Count': counts})
intr_df = intr_df.sort_values(['Count'], ascending = False)
intr_df.head(10)
We can also check the GC content in the Neanderthal introgressed regions:
intr_long_seq = ''.join(intr_seqs)
((intr_long_seq.count('C') + intr_long_seq.count('G')) / len(intr_long_seq))*100
The GC-content for the chopped Neanderthal introgression sequences looks quite different from what we would have obtained without chopping the sequences.
%%bash
cat hg19_introgr_clean.fa | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A","");
c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'
((1415460271 + 1416141395) / (2131712452 + 1415460271 + 1416141395 + 2138472589))*100
Now for comparison we will check the k-mer / word / token count for Neanderthal depeleted sequences and compute the GC-content for the chopped and full regions:
import pandas as pd
depl_counts = dict(Counter([item for sublist in depl_sentences for item in sublist]))
kmers = list(depl_counts.keys())
counts = list(depl_counts.values())
depl_df = pd.DataFrame({'Kmer': kmers, 'Count': counts})
depl_df = depl_df.sort_values(['Count'], ascending = False)
depl_df.head(10)
depl_long_seq = ''.join(depl_seqs)
((depl_long_seq.count('C') + depl_long_seq.count('G')) / len(depl_long_seq))*100
%%bash
cat hg19_depl_clean.fa | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A","");
c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'
((1442323861 + 1442967008) / (2106254094 + 1442323861 + 1442967008 + 2110241744))*100
merge_df = pd.merge(intr_df, depl_df, on = 'Kmer')
merge_df['Odds'] = merge_df['Count_y'] / merge_df['Count_x']
sorted_merge_df = merge_df.sort_values(['Odds'], ascending = False)
sorted_merge_df.head()
sorted_merge_df.tail()
What we can see is that there is 0.5% for the chopped sequences and almost 0.8% difference in GC-content between the Neanderthal depleted and introgressed regions. It looks like the depleted regions have a higher GC-content indicating that they most likely belong to the gene regions.
Now we can think that the k-mers originating from the same sequence build a sentence or text or document (doc). For example let us display one sentence, we can clearly see words / k-mers separated by spaces lie in a regular text.
' '.join(intr_sentences[0])
Now we are going to convert the sentences into word vectors, i.e. quantify the words / k-mers based on their similarity to each other. The number of unique k-mers gives us the vocabulary / dictionary of our molecular / biological texts.
import warnings
warnings.filterwarnings('ignore')
from gensim.models import Word2Vec
model = Word2Vec(sentences, min_count = 2, workers = 4)
print(model)
X = model[model.wv.vocab]
X.shape
Now each word is one observation, this observation has 100 coordinates, i.e. the default number of latent variables for word2vec. Next we can try to use the constructed word vectors and visualize the k-mers space using PCA and UMAP.
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
X = model[model.wv.vocab]
pca = PCA(n_components = 2)
result = pca.fit_transform(X)
plt.figure(figsize=(20,18))
plt.scatter(result[:, 0], result[:, 1], s = 10, cmap = 'tab10')
plt.title('Principal Components Analysis (PCA) of K-mers', fontsize = 20)
plt.xlabel("PC1", fontsize = 20)
plt.ylabel("PC2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
if word == 'CCGCG':
plt.text(result[i, 0], result[i, 1], word, fontsize = 30, c = 'green')
elif word == 'CGCGG':
plt.text(result[i, 0], result[i, 1], word, fontsize = 30, c = 'green')
elif word == 'TTATA':
plt.text(result[i, 0], result[i, 1], word, fontsize = 30, c = 'green')
elif word == 'TATAA':
plt.text(result[i, 0], result[i, 1], word, fontsize = 30, c = 'green')
else:
plt.text(result[i, 0], result[i, 1], word, fontsize = 10, c = 'red')
plt.show()
From the PCA plot it is clear that the left cluster of k-mers has an enrichment of GC-nucleotides while the left bigger cluster has predominantly AT-nucleotides.
from umap import UMAP
import matplotlib.pyplot as plt
X = model[model.wv.vocab]
X_reduced = PCA(n_components = 5).fit_transform(X)
umap_model = UMAP(n_neighbors = 30, min_dist = 0.1, n_components = 2)
umap = umap_model.fit_transform(X_reduced)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], s = 10, cmap = 'tab10')
plt.title('UMAP', fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20)
plt.ylabel("UMAP2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
if word == 'CCGCG':
plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
elif word == 'CGCGG':
plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
elif word == 'TATAA':
plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
elif word == 'TTATA':
plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
else:
plt.text(umap[i, 0], umap[i, 1], word, fontsize = 10, c = 'red')
plt.show()
from sklearn.manifold import TSNE
plt.figure(figsize=(20, 15))
X_reduced = PCA(n_components = 5).fit_transform(X)
tsne_model = TSNE(learning_rate = 500, n_components = 2, random_state = 123, perplexity = 30)
tsne = tsne_model.fit_transform(X_reduced)
plt.scatter(tsne[:, 0], tsne[:, 1], cmap = 'tab10', s = 10)
plt.title('tSNE on PCA', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20)
plt.ylabel("tSNE2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
if word == 'CCGCG':
plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 30, c = 'green')
elif word == 'CGCGG':
plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 30, c = 'green')
elif word == 'TATAA':
plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 30, c = 'green')
elif word == 'TTATA':
plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 30, c = 'green')
else:
plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 10, c = 'red')
plt.show()
From both UMAP and tSNE plots we again clearly see that the left cluster of k-mers has an enrichment of GC-nucleotides while the left bigger cluster has predominantly AT-nucleotides.
Now let us define an important variable max_length which will be used in the keras Embedding layer. This variable indicates how many k-mers we have in each text / sequence. Next we will concatenate the k-mers from each sequence into a "sentence", these sentences will build documents (docs):
max_length = len(sentences[0])
print(max_length)
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
docs = [' '.join(i) for i in sentences]
print(docs[0],'\n')
print('We have ' + str(len(docs)) + ' texts / sequences')
print('Each text / sequence has ' + str(len(docs[0].split(' '))) + ' words / k-mers')
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(docs)
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(docs)
# pad sequences
max_length = max([len(s.split()) for s in docs])
print(max_length)
#Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(encoded_docs[0])
print(len(encoded_docs[0]))
print(len(encoded_docs))
import numpy as np
my_labels = list(np.ones(len(intr_seqs))) + list(np.zeros(len(depl_seqs)))
len(my_labels)
# integer encode the documents
#vocab_size = 1500
#encoded_docs = [one_hot(d, vocab_size) for d in docs]
#print(encoded_docs[0:10])
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(
padded_docs, my_labels, test_size = 0.2, random_state = 42)
train_features.shape
len(train_labels)
test_features.shape
len(test_labels)
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1
import warnings
warnings.filterwarnings('ignore')
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length = max_length))
model.add(Conv1D(filters = 128, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Conv1D(filters = 128, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.5))
#model.add(Conv1D(filters = 128, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
#model.add(Activation("relu"))
#model.add(MaxPooling1D(pool_size = 2))
#model.add(Dropout(0.4))
model.add(Flatten())
model.add(Dense(16, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.5))
model.add(Dense(1, activation = 'sigmoid'))
epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
import warnings
warnings.filterwarnings('ignore')
history = model.fit(train_features, train_labels,
epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 128, shuffle = True)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
import keras.backend as K
def get_layer_output_grad(model, inputs, outputs, layer=-1):
""" Gets gradient a layer output for given inputs and outputs"""
grads = model.optimizer.get_gradients(model.total_loss, model.layers[layer].output)
symb_inputs = (model._feed_inputs + model._feed_targets + model._feed_sample_weights)
f = K.function(symb_inputs, grads)
x, y, sample_weight = model._standardize_user_data(inputs, outputs)
output_grad = f(x + y + sample_weight)
return output_grad
import numpy as np
import keras
from keras import backend as K
#model = keras.Sequential()
#model.add(keras.layers.Dense(20, input_shape = (10, )))
#model.add(keras.layers.Dense(5))
#model.compile('adam', 'mse')
dummy_in = np.ones((4, train_features.shape[1]))
#dummy_in = train_features[0].reshape(1,-1)
dummy_out = np.ones((4, 1))
#dummy_out = np.ones((1, 1))
#dummy_in = np.ones((4, 10))
#dummy_out = np.ones((4, 5))
dummy_loss = model.train_on_batch(dummy_in, dummy_out)
output_grad = get_layer_output_grad(model, dummy_in, dummy_out)
print(dummy_in)
print(dummy_out)
print(output_grad)
dummy_in.shape
dummy_out.shape
np.array(output_grad).shape